关于多模式情感分析的现有研究在很大程度上依赖文本方式,不可避免地会引起文本单词和情感标签之间的虚假相关性。这极大地阻碍了模型的概括能力。为了解决这个问题,我们定义了分发(OOD)多模式分析的任务。该任务旨在估计和减轻文本方式对强大概括的不良影响。为此,我们接受了因果推断,该因果通过因果图检查了因果关系。从图中,我们发现虚假相关性归因于文本模式对模型预测的直接影响,而间接相关性通过考虑多模式语义来更可靠。受此启发的启发,我们设计了一个模型不足的反事实框架,用于多模式情感分析,该框架通过额外的文本模型捕获文本模式的直接效果,并通过多模型估算间接模型。在推断期间,我们首先通过反事实推断估算直接效应,然后从所有模式的总效应中减去它以获得可靠预测的间接效应。广泛的实验显示了我们提出的框架的卓越有效性和概括能力。
translated by 谷歌翻译
多模式面向任务的对话框系统的文本响应生成旨在在给定多模式上下文的情况下生成适当的文本响应,这是一项必不可少但具有挑战性的任务。尽管现有的努力取得了令人信服的成功,但他们仍然遭受了两个关键的局限性:1)忽略生成预训练的好处,以及2)忽略与文本上下文相关的知识。为了解决这些局限性,我们为多模式的以任务为导向的对话框系统(DKMD)提出了一种新颖的双重知识增强的生成预验证的语言模型,由三个关键组成部分组成:双重知识选择,双重知识增强上下文上下文学习和知识增强的响应响应一代。具体来说,双重知识选择组件旨在根据给定上下文的文本和视觉方式选择相关的知识。此后,双重知识增强的上下文学习组件是从全球和局部观点上都无缝地将所选知识整合到多模式上下文的学习中,并探索了跨模式的语义关系。此外,知识增强的响应生成部分包括经过修订的Bart解码器,其中引入了其他点产品知识折线,以明确利用知识来推进文本响应生成。公共数据集的广泛实验验证了拟议的DKMD优于最先进的竞争对手。
translated by 谷歌翻译
视觉问题的视觉关注在视觉问题上应答(VQA)目标在定位有关答案预测的右图像区域,提供强大的技术来促进多模态理解。然而,最近的研究指出,来自视觉关注的突出显示的图像区域通常与给定的问题和答案无关,导致模型混淆正确的视觉推理。为了解决这个问题,现有方法主要是为了对准人类关注的视觉注意力。尽管如此,收集这种人类数据是费力且昂贵的,使其在数据集中调整良好开发的模型。为了解决这个问题,在本文中,我们设计了一种新的视觉关注正规化方法,即attreg,以便在VQA中更好地视觉接地。具体而言,attraT首先识别了由骨干模型出乎意料地忽略(即,分配低注意重量)的问题所必需的图像区域。然后,利用掩模引导的学习方案来规范视觉注意力,以便更多地关注这些忽略的关键区域。所提出的方法是非常灵活的,模型不可知,可以集成到基于大多数基于视觉关注的VQA模型中,并且不需要人类注意监督。已经进行了三个基准数据集,即VQA-CP V2,VQA-CP V1和VQA V2的广泛实验,以评估attreg的有效性。作为副产品,将Attreg纳入强基线LMH时,我们的方法可以实现新的最先进的准确性为60.00%,在VQA-CP V2基准数据集上绝对性能增益为7.01%。 。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Interview has been regarded as one of the most crucial step for recruitment. To fully prepare for the interview with the recruiters, job seekers usually practice with mock interviews between each other. However, such a mock interview with peers is generally far away from the real interview experience: the mock interviewers are not guaranteed to be professional and are not likely to behave like a real interviewer. Due to the rapid growth of online recruitment in recent years, recruiters tend to have online interviews, which makes it possible to collect real interview data from real interviewers. In this paper, we propose a novel application named EZInterviewer, which aims to learn from the online interview data and provides mock interview services to the job seekers. The task is challenging in two ways: (1) the interview data are now available but still of low-resource; (2) to generate meaningful and relevant interview dialogs requires thorough understanding of both resumes and job descriptions. To address the low-resource challenge, EZInterviewer is trained on a very small set of interview dialogs. The key idea is to reduce the number of parameters that rely on interview dialogs by disentangling the knowledge selector and dialog generator so that most parameters can be trained with ungrounded dialogs as well as the resume data that are not low-resource. Evaluation results on a real-world job interview dialog dataset indicate that we achieve promising results to generate mock interviews. With the help of EZInterviewer, we hope to make mock interview practice become easier for job seekers.
translated by 谷歌翻译
Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
In the new era of personalization, learning the heterogeneous treatment effect (HTE) becomes an inevitable trend with numerous applications. Yet, most existing HTE estimation methods focus on independently and identically distributed observations and cannot handle the non-stationarity and temporal dependency in the common panel data setting. The treatment evaluators developed for panel data, on the other hand, typically ignore the individualized information. To fill the gap, in this paper, we initialize the study of HTE estimation in panel data. Under different assumptions for HTE identifiability, we propose the corresponding heterogeneous one-side and two-side synthetic learner, namely H1SL and H2SL, by leveraging the state-of-the-art HTE estimator for non-panel data and generalizing the synthetic control method that allows flexible data generating process. We establish the convergence rates of the proposed estimators. The superior performance of the proposed methods over existing ones is demonstrated by extensive numerical studies.
translated by 谷歌翻译